Efficient Cosmological Parameter Estimation with Hamiltonian Monte Carlo 
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Traditional Markov Chain Monte Carlo methods suffer from low acceptance rate, slow mixing 
and low efficiency in high dimensions. Hamiltonian Monte Carlo resolves this issue by avoiding 
the random walk. Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) 
technique built upon the basic principle of Hamiltonian mechanics. Hamiltonian dynamics allows 
the chain to move along trajectories of constant energy, taking large jumps in the parameter space 
with relatively inexpensive computations. This new technique improves the acceptance rate by a 
factor of 4 while reducing the correlations and boosts up the efficiency by at least a factor of D in 
a D-dimensional parameter space. Therefor shorter chains will be needed for a reliable parameter 
estimation comparing to a traditional MCMC chain yielding the same performance. Besides that, 
the HMC is well suited for sampling from non-Gaussian and curved distributions which are very 
hard to sample from using the traditional MCMC methods. The method is very simple to code and 
6J[)' can be easily plugged into standard parameter estimation codes such as CosmoMC. In this paper we 

^ demonstrate how the HMC can be efficiently used in cosmological parameter estimation. 
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I. INTRODUCTION 

> 

■ The wealth of information in the high quality data of the recent Wilkinson Microwave Anisotropy Probe (WMAP) 
\ satellite experiment combined with other recent advances in observational cosmology has led the field of cosmology 
\& into the new era of precision cosmology Q , @ ■ These information are used to constrain the cosmological models, by 
putting estimates and error bars on various quantitative parameters of the cosmological models. Markov Chain Monte 
Carlo methods are the most popular techniques of parameter estimation in cosmology. These methods first introduced 
in the astrophysical context by 3]. Since then MCMC have been used in several problems in cosmology. Standard 

£pH| planet data. 

The MCMC methods are much faster than the grid-based ones but even with the currently available data a naive 
MCMC algorithm takes a long time to converge and estimating cosmological parameters specially those outside the 

r/2 . LCDM is a challenging task. This problem is much more important for the future data of the upcoming experiments 
like ACT [H and Planck that will map the CMB sky on small scales (large I). It seems necessary to design new 

^ , methods and techniques to speed up the parameter estimation in a more efficient way. This can be done in two ways 



packages were made publicly available for cosmological parameter estimation including CosmoMC Analyze This! 
[j| . MCMC techniques have also been applied to the gravity wave data analysis 0] , 0] and to modelling extrasolar 



The first method is by speeding up the power spectrum and likelihood calculations. Interpolation based methods 



such as CMBWarp 8], Pico |!| and CosmoNet [lfj are essentially fast methods of computing the power spectrum 
and likelihood from a pre-computed points on a grid. Also 11] proposed a method to speed up the power 
spectrum calculation and a similar method is being used in the current version of CAMB. These methods help 
MCMC steps to be taken faster. 

• The second class of methods which can be used in parallel with the first one improve the efficiency of the MCMC 
samplers. Therefor shorter MCMC chains will be needed for a reliable parameter estimation. Examples of these 
methods are COG sampler 0] > Metropolis sampler with optimal step-size 0] and the slice sampler in the 
latest version of CosmoMC. In this paper, we apply Hamlitonian methods to cosmological problems. As we will 
show, these methods significantly improve the efficency of the MCMC samplers. 

Hamiltonian Monte Carlo (HMC) as first introduced by is a Markov chain Monte Carlo (MCMC) technique 
built upon the basic principle of Hamiltonian mechanics. Its potential in Bayesian computation makes it a very 
effective means for exploring complex posterior distributions. Hamiltonian Monte Carlo is a MCMC technique in 
which a momentum variable is introduced for each parameter of the target probability distribution function (pdf). In 
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analogy to a physical system, a Hamiltonian H is denned as a kinetic energy involving the momenta plus a potential 
energy U , where U is minus the logarithm of the target density. Hamiltonian dynamics allows one to move along 
trajectories of constant energy, taking large jumps in the parameter space with relatively few evaluations of U and 
its gradient. The Hamiltonian algorithm alternates between picking a new momentum vector and following such 
trajectories. The efficiency of the Hamiltonian method for multidimensional isotropic Gaussian pdf's is shown to 
remain constant even at high dimensions. Almost all of the traditional MCMC algorithms suffer from the inefficiency 
caused by the random walk nature of the Metropolis algorithm. The HMC algorithms and some variants on slice 
sampling can be used to prevent the walker from doubling back and hence to obtain faster convergence and better 
efficiency. 



II. BAYESIAN INFERENCE WITH MARKOV CHAIN MONTE CARLO 



Bayesian inference in brief is a statistical method by which unknown properties of a system may be inferred from 
observation. In Bayesian inference method we model the system of interest by some model with parameters {xi}. 
We also introduce a likelihood function to quantify the probability of the observed data, given the above model, 
f(data\{xi}). The Bayesian approach to statistical inference then states that 

n({xi}\data) oc f(data\{xi})p({xi}). (1) 

The distribution ir{{xi}\data) is referred to as the posterior distribution and represents our beliefs about the model 
parameters after having observed the data. The distribution p({xi}) is referred to as the prior and is concerned with 
our original beliefs about the model parameters, {xi} before observing the data. Thus, Bayes' theorem allows us to 
update our prior beliefs on the basis of the observed data to obtain our posterior beliefs, which form the basis for our 
inference. 

Often we are interested in the marginalized distributions of the parameters 

(x k \data) — J x k ir({xi}\data)dxidx2 ■ ■ ■ dx D} (2) 

where D is the number of dimensions of the parameter space. At large D, the above integral becomes very hard (and 
very quickly impossible) to do. The Monte Carlo principle helps us to carry out such integrals. 



A. Monte Carlo Principle 

The basic idea of Monte Carlo simulation is to draw an independent and identically-distributed set of samples 
Xi = x\,X2, ■ ■ ■ ,xn from a target density p(x). The target density can be approximated by these samples as 

1 N 

Pn{x) — — &(x,xi) (3) 

i=\ 

where £( x ,xj) is the Kronecker delta function. Consequently, samples from the chain can be used to approximate the 
integrals with tractable sums. An integral 



f(x)p(x)dx (4) 



can be approximated by averaging the function over the chain 

N 

V 



M/) = iE/(*i)- ( 5 ) 



Often the posterior mean, x, and the variance, (x — x) 2 , are of interest. The can be obtained using f(x) — x and 
f(x) = {x — x) 2 , respectively. Drawing independent samples from p(x) is straight forward when it has a standard 
form (for example Gaussian). But in most of the problems, this is not the case and we have to use more sophisticated 
techniques such as Markov Chain Monte Carlo (MCMC). The MCMC technique has opened up the possibility of 
applying Bayesian analysis to complex analysis problems. 
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B. Markov Chain Monte Carlo Methods 



Markov Chain Monte Carlo is a Markov chain|32( that is constructed to generate samples from a desired target 
density, p(x). 

The most popular MCMC algorithm is the Metropolis- Hastings algorithm. In this algorithm one takes a trial step, 
x* given the current state, x, from an easy-to-sample proposal distribution q(x*\x). This trial step is then accepted 
with the probability of 



mm{l, } 
p(x)q(x\x*) 



(6) 



otherwise it remains at x. A simple instance of this algorithm is the Metropolis algorithm that uses a symmetric 
proposal distribution 



and hence the acceptance ratio simplifies to 



q(x*\x) = q(x\x*) 
p(x*)- 



i{l, 



p(x) 



} 



(7) 



(8) 



If q(x*\x) — f(Ax = x — x*) for some arbitrary function /, then the kernel driving the chain is a random walk and the 
proposed step is of the form x* — x + Ax, where Ax ~ /. There are many common choices for / including the uniform 
distribution on the unit disk, multivariate normal or a i-distribution. All of these distributions are symmetric, hence 
the acceptance probability has the simple form of eqn.© The algorithm below shows how this is done in practice 



initialize xq 
for i = 1 to N. 



Random Walk Metropolis 
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10 : end for 



steps 

sample Ax from the proposal distribution: Ax ~ q(Ax\x) 
x* = x + Ax 
draw a ~ Uniform(0, 1) 

ifa<min{l,Hg)} 



x% 



else 



In order for this method to be effective, samples must be as uncorrelated and independent as possible. A common 
problem of MCMC techniques is that samples can be highly correlated. This makes the method inefficient. In this 
case although the samples are drawn from the correct distribution, they sample the target density very slowly. And 
a huge number of samples might be needed for reliable estimates. 



C. Hamiltonian Monte Carlo 



Hamiltonian Monte Carlo (HMC) belongs to the class of MCMC algorithms with auxiliary variable samplers |33|. 
HMC is a MCMC technique in which a momentum variable is introduced for each parameter of the target probability 
distribution function (pdf). In analogy to a physical system, a Hamiltonian H is defined as a kinetic energy involving 
the momenta plus a potential energy U , where U is minus the logarithm of the target density. Hamiltonian dynamics 
allows one to move along trajectories of constant energy, taking large jumps in the parameter space with relatively 
few evaluations of U and its gradient. The Hamiltonian algorithm alternates between picking a new momentum 
vector and following such trajectories. This algorithm is designed to improve mixing in high dimensions by getting 
more uncorrelated and independent samples. To explain this in more detail, suppose we wish to sample from the 
distribution p(x), where x G 1Z D (1Z D is the D-dimensional parameter space of our problem). We augment each 
parameter xt with an auxiliary conjugate momentum m, and define the potential energy 



C/(x) = -lnp(x), 



(9) 
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and the Hamiltonian 



H(x,u) = U(x)+K(u), 



where K(u) = u r u/2 is the kinetic energy. This is used to create samples from the extended target density 



p(x, u) oc exp(— H(x, u)) 

= exp(— U(x)) exp(- 
= p(x)AA(u;0,l), 



(10) 



(11) 



■K(u)) 



where 7V"(u; 0, 1) is the D-dimensional normal distribution with zero mean and unit variance. This density is separable, 
so the marginal distribution of x is the desired distribution p(x). This means if we can sample (x, u) from the extended 
distribution p(x, u), then the marginal distribution of x is exactly the target distribution p(x). 

Each step in HMC consists of drawing a new pair of samples (x, u) according to p(x, u) starting from the current 
value of x and generating a Gaussian random variable u. These arc our initial conditions. The time evolution of this 
system is then governed by Hamiltonian equations of motion 



X i 



(12) 



Ui = - 



dH 



Because the Hamiltonian dynamics is time-reversible, volume-preserving and total energy preserving, if the dynamics 
are simulated exactly, the resulting moves will leave the extended target density, p(x, u), invariant. That is, if we 
start from (x( ),U( )) ~ p, then after the system evolves for time t, the new configuration at time t, (x( t ),U( t )) also 
follows the distribution p. But in practice the dynamics are simulated with a finite step-size, and as a result, H won't 
remain invariant. The Hamiltonian dynamics in practice is simulated by the leapfrog algorithm with a small step-size 



e (dU_ 

2 \dxi 



(13) 



Xi(t + e) = Xi{t) + eui{t + -) 



«i(*+2) 



i(t) 



e (dU_ 

2 \dxi 



Each leapfrog move is volume-preserving and time-reversible. But it does not keep the H constant. The effect of 
inexact simulation introduced by the non-zero step-size can be eliminated by the Metropolis rule: the point reached 
by following the dynamics is accepted with the probability of 



iin{l, e 



-(H(x*,u*)~H(x,u)) 



}■ 



(14) 



The algorithm of HMC is shown below 



Hamiltonian Monte Carlo 
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initialize x, 



(o) 



for i = 1 to N samp i es 
u~ Af(0,l) 

( X (0)' U (0)) = ( x (*-i), u ) 
for j = 1 to N 

make a leapfrog move: {■x* u _ 1) ,u* {j _ 1) ) -» (x^u^) 

end for 

(x*,u*) = (x (JV ),u ( jv)) 
draw a ~ Uniform(0, 1) 
if a < min{l, e -(H(x*,u*)-H(x,u))| 

x (i ) = x* 

else 

x (i) = x (i _ 1} 

end for 
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To see how this method works, let's consider the simplest target distribution: a one dimensional Gaussian distri- 
bution. The Hamiltonian for this case is that of a harmonic oscillator 

2 ? 
T 11 

ff^u) = -_ + !-. (15) 

The phase space trajectories of this system are ellipses with their size(area) proportional to H . In practice, a HMC 
algorithm would run along such an ellipse while performing leapfrog moves. If the algorithm consisted of only this 
leapfrog update, it would not be irreducible. Because all points generated from a starting point will never leave an 
ellipse of constant H and hence will not visit all points in the parameter space. The Gibbs update of the momentum 
(line 3 of the HMC algorithm shown above) rectifies this situation. At the beginning of every iteration the momentum 
is updated in such a way that that the Markov chain gets a chance of reaching all other values of H and hence visiting 
every point in the parameter space after enough number of iterations. 

In most practical cases (and almost always in astrophysics), we don't know the analytical form of the target 
distribution, j?(x). Therefor one can not derive the gradient of the logarithm of the p(x) in a closed form. Usually 
computing the p(x) is expensive enough that makes it impossible to compute the gradient numerically either. In this 
case, one performs an exploratory run of a simple MCMC method to get an idea of the p(x). This exploratory run 
can be used to fit a function to the minus logarithm of target density, t/(x). This is like estimating the proposal 
distribution in traditional MCMC algorithms except that we are no more limited to "easy-to-sample" distributions. 
Gradients of this proposal distribution can be used as estimates of gradients of the actual likelihood. This is explained 
in a cosmological context in Section IVI 

There are two parameters to tune while working with the HMC: step-size e and the number of leapfrog steps to be 
taken. As it was mentioned above, e should be chosen small enough to do the Hamiltonian dynamics correctly. Big 
step-sizes will bring errors into the simulation and won't keep the total energy constant. Hence the acceptance rate 
decreases. In the simplest case, e.g. eqn. 1151 it can be immediately seen that a reasonable e should be smaller than 
the "natural" time-scale of the system which is 

c<(|fr 1/2 - (16) 

The number of leapfrog steps, N, should be large enough to take the walker far from the starting point. A choice of 
Ne = do has worked well for most of the problems we deal with. It helps if the Ne is randomized in order to avoid the 
resonance condition poj , and it also improves the efficiency. However the choice of these two parameters may vary 
from a problem to another. The best choice would be obtained by monitoring the acceptance rate and the efficiency 
of the resulting chains. 

The HMC resolves some inefficiencies of the traditional MCMC methods such as low acceptance rate, high correla- 
tions between the consequent steps, slow mixing and slow convergence. To be able to compare different methods, we 
need a set of reliable diagnostic tools. The next section introduces some of these tools and they will be used in the 
rest of this paper to compare different techniques. For more details on the HMC technique see references 0, [l9j. 
IH, E3, m, M, and M- 



III. CONVERGENCE DIAGNOSTICS AND EFFICIENCY 



While working with MCMC algorithms there are a number of important questions that one should answer. These 
questions are Are we sampling from the distribution of interest? Are we are seeing the full distribution? Are we able 
to estimate parameters of interest to a desired precision? And are we doing all this efficiently? In order to answer 
the above questions, several diagnostic tests have been developed so far. In this paper we will use four methods to 
quantify the convergence and efficiency of MCMC algorithms. 

Autocorrelation: To examine relationships of successive steps of a chain, Xj,the autocorrelation function (ACF) can 
be used. The autocorrelation function at lag I is 

YJlZl{ x i-x){x i+ l -X) 



An ideal sampler with no correlations between successive elements of the chain will have an autocorrelation that 
quickly vanishes. High autocorrelations within chains indicate slow mixing and, usually, slow convergence. It could 
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be useful to thin out a chain with high autocorrelations before calculating summary statistics: a thinned chain may 
contain most of the information, but take up less space in memory. 
Autocorrelation Length: The autocorrelation length of a chain is defined as 

Imam 

L = 1 + 2 J^p(l) (18) 

i=i 

Where l max is the maximum lag. It is good to stop summing at a reasonable l max beyond which the autocorrelation 
becomes noisy to avoid introducing errors into the autocorrelation length calculation. An ideal sampler will have 
L = 1, and larger L would indicate more correlation between the data. 
Power Spectrum: The power spectrum of an infinite chain is defined as 

(X(k)X*(k')) = 5(k - k')P(k), (19) 

where X(k) is the Fourier transform of the chain, Xi. In practice, the power spectrum is estimated from a finite chain 
in the following way 

P(k) = \a k \ 2 , (20) 

in which a k are the discrete (inverse) Fourier transform of the chain divided by the square root of the number of 
samples in the chain, N. And k — 2irj/N for j = 1, • • ■ , N/2 — 1. An ideal sampler will have a flat power spectrum. 
While actual MCMC chains have correlations on small scales and therefore their power spectrum bends on small 
scales |l3|. The P(k) at k = is an estimate of the sample mean variance, er?(A), 

°*(*0 = 5> Po=P(k = 0). (21) 



Efficiency: The statistical efficiency of an MCMC sequence is defined as the ratio of the number of independent 
draws from the target pdf to the number of MCMC iterations required to achieve the same variance in an estimated 
quantity. The efficiency is defined as 

E = lira 4r^ (22) 

AT^oo a? (AO V ' 

where ctq is the variance of the underlying distribution and a s is the variance of the sample mean from the chain. 
E" 1 is an estimate of the factor by which the MCMC chain is longer than an ideal chain. Comparing can. (|22|l with 
ean. (|21|l . we see that the efficiency of an MCMC chain is given by Po 

E= a l (23) 
For more diagnostics see references [27]], and references there in. 



IV. APPLICATIONS OF HMC 



We start from the simplest distribution, D-dimensional Gaussian distribution 

E(x) = ^x T C- 1 x, Cy = (fiSij (24) 

with the gradient 

AB(x) = C^x (25) 

A 6-dimensional Gaussian distribution (as given above) with ao = lis sampled with a chain of length N samp i es = 8192 
using two methods: the Hamiltonian Monte Carlo method with N — 100 leapfrog steps taken at each iteration at a 
step-size chosen such that Ne = ao, and a Metropolis algorithm that uses a Gaussian proposal distribution with a 
step-size ctt/^o ~ 2.4/ ^/D. 
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FIG. 1: Samples drawn from an isotropic six-dimensional Gaussian distribution using the HMC (top) and the Metropolis 
algorithm with optimal step-size (bottom). 



Fig. H shows the chains generated by the two algorithms. These chains are used to compare the two methods. 
The acceptance rate for the HMC is 99% while it is 25% for the MCMC. HMC improves the mixing by accepting 
independent samples 4 times more than the traditional MCMC chain. Correlations between the successive steps in 
these methods are shown in Fig. The small panel shows the autocorrelations: the HMC autocorrelation dies off 
much faster than that of the MCMC. This feature is seen much better in the power spectrum of the chains. The HMC 
takes a smaller time to enter the white-noise regime at large scales than MCMC. This is given by the scale on which 
the power spectrum turns over and becomes flat which is around k — 0.1 for the HMC and k = 0.02 for the MCMC. 
The efficiency of the chains can be read directly from the power spectra of Fig. [5] The efficiency is given by ao /Pq 
and since ctq = 1 , it is simply equal to 1/ Pq. The ratio of efficiencies are 



Ehmc P( 



MCMC 
E M CMC ~ Pn HMC 



6. 



(26) 



Therefore in 6 dimensions an MCMC chain is 6 times longer than an HMC chain yielding the same performance. 
Later we will show that this number scales as the number of dimensions and hence the HMC does better than the 
MCMC on higher dimensions. 

This can be seen more clearly in Fig. |3J where histograms of marginal distributions and two-dimensional contour 
plots are plotted for the both chains. For the same length of chains, the HMC gives more accurate results. 



A. Efficiency of the HMC 



The exercise above can be repeated for other dimensions to see how the efficiency of the HMC chains arc scales 
with the number of dimensions of the problem of interest. This can be done in two ways 
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FIG. 2: Power spectrum, P(k), for the Hamiltonian Monte Carlo (solid red line) compared to that of a traditional MCMC 
method with an optimal step-size (dashed blue line). 



1. by comparing the sample mean variance to an ideal chain; ean. H22|) . 

2. by computing the power spectrum and deriving the P(k = 0); ean. H23|) . 

We computed the efficiency of the HMC using the first method and cross checked it with the second method to confirm 
our results. The analysis are done for the HMC algorithm with Ne = 1. The efficiency is plotted versus the number 
of dimensions in Figure 01 and shows that it remains constant even in high dimensions. 

As a comparison, the optimal efficiency of the MCMC method is inversely proportional to the number of dimensions; 
E w g-ijy 13]. And hence the ratio of the two efficiencies is proportional to the number of dimensions 

Ehmc . D (2?) 



Emcmc 

Therefore in _D-dimensions an HMC chain is D times shorter than an MCMC chain yielding the same performance 



B. Non-Gaussian And Curved Distributions 



In 2 or more dimensions, the distribution of our interest may be very complicated. Sampling from non-Gaussian or 
curved distributions can be difficult and inefficient. In these cases usually a re-parametrization of the problem helps 
a lot to transform the distribution to a relatively simpler one which is easier to sample from |29j . The HMC can 
im pro ve the efficiency of sampling from these distributions. To demonstrate this, we choose the worst case scenario 
of [l3J; a thin, curved, non-Gaussian distribution given by 



{x 2 + y 2 -l) 2 j? 
8a 2 2a 2 



E= ^ x ' (28) 
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FIG. 3: Marginalized distributions and two-dimensional contour plots of the 6-dimensional Gaussian model sampled with chains 
of length Nsampiea = 8192 using the HMC (left) and MCMC (right). 
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FIG. 4: Average efficiency of the HMC vs. the number of dimensions. 
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FIG. 5: left: The first 1000 samples drawn from the crescent-like distribution of eqn. 1281 and their power spectra, P(k) {right). 



Sampling the above distribution in the present form is very inefficient with the traditional MCMC methods In 
contrast, the HMC can easily sample that. Figure |3] shows the first 1000 samples drawn from the above distribution 
and their power spectra. The widths are chosen such that the dimensionless parameter cf/cri = 5. As it can be seen 
in the figure, the chains have reached the equillibrium shortly and are sampling from the desired distribution. This is 
an idealized case where we exactly know the gradient at every point in the parameter space. However in a real- world 
problem we don't know the gradients very accurately. We will discuss this in the next section and by applying this 
method to cosmological parameter estimation we will show that even a simple guess of the gradients will work. But 
better guesses will result in more efficient estimations. 



V. COSMOLOGICAL PARAMETER ESTIMATION WITH HAMILTONIAN MONTE CARLO 

The HMC is shown to be very promising in reducing the length of Markov chains for a reliable inference. Hence, 
this method can potentially speed up the cosmological parameter estimation by reducing the number of samples of a 
Markov chain. Taking a closer look at a typical algorithm for parameter estimation we will see the bottle necks of 
these algorithms are 

1. Computing the power spectrum: at each step the CMB power spectrum should be computed for the new set of 
parameters. As the new experiments push to the larger I, computing the power spectrum becomes more time 
consuming. 

2. Computing the likelihood: at each step, given the power spectrum, the likelihood should be computed. This is 
not as slow as the power spectrum computation, but becomes a bottle neck in long chains. 

3. Number of samples needed in Markov chains: usually long Markov chains must be generated to achieve reliable 
results. Two practical issues of the MCMC chains are high correlations and low efficiency that result in long 
burn in times and high rejection rate. Specially in high dimensions, very lengthy Markov chains are needed (the 
'curse of dimensionality'). 

The first two problems have attracted a lot of attention recently and fast methods of computing the power spectrum 
and likelihood have been designed and All of these methods try to speed up the parameter estimation 
by making each step taken in a shorter time, but due to the nature of the traditional MCMC methods, still long 
chains are needed to obtain reliable results. An orthogonal and complimentary approach is to reduce the length of 
the Markov chains. This can be done by the HMC method. HMC can be done in ordinary cosmological parameter 
estimation codes. The method is simple and it is straightforward to add a module to a standard cosmological 
parameter estimation package, like CosmoMC, to perform that. The HMC algorithm needs to compute the gradient 
of the minus logarithm of the likelihood in each step. This can be done using an auxiliary distribution that mimics 
the minus logarithm of the likelihood function but is fast to compute. The leapfrogs can then be taken with no major 
additional cost of time. 



11 



30 



20 
10 

1 ° 

a 

-10 
-20 



-30 

10 20 30 40 50 60 70 80 

Sequence Number 

FIG. 6: Gradients of the minus logarithm of likelihood computed at 85 random positions using a Gaussian approximation to 
the likelihood and modified Lico. 

A. Computing The Gradients: Zeroth Order Approximation 

The simplest choice for an auxiliary function to estimate the gradients of the minus logarithm of the likelihood is 

^— - = (x - xjtC-^x - x), (29) 

and the gradient would be 

H = *c (c./l-r, - Xj) + (xj x^Cjl 1 ) (30) 
= 2a c C^ 1 (xj - Xj). 

As an example we sample from the 6-parameter flat LCDM model using the HMC. An exploratory run of traditional 
MCMC can be performed to obtain the fitting parameters ac, £ and x. Using the simple MCMC option of CosmoMC 
(sampling_method=l option is to use the default Metropolis algorithm in the optimal fast/slow subspaces) at the 
exploratory phase the following fit to the minus logarithm of the likelihood is found 

C ~ (x-xJ+C-^x-XJojc+Z (31) 
= 0.5164 (x - x)t C~ x (x-x) + 5626, 

where C%j is the covariance matrix and x are the means, all estimated from the exploratory chains. These are all 
we need, the gradient can then be estimated using eqn. (|30|l . Running modified CosmoMC with the above sampling 
method, yields 81% acceptance rate. Proposed steps are rejected 19% of times because of the imperfect estimate of 
the gradient of the field which causes inaccurate simulation of Hamiltonian dynamics. Even this simplest choice of 
the auxiliary function improves the acceptance rate by more than a factor of 3 and reduces the correlations which in 
turn boosts up the efficiency of the chain and reduces the burn in time. However, this can be further improved to 
achieve a close-to-ideal sampler. 




B. Computing The Gradients: Using Lico 



A good feature of the HMC is that we are not limited to Gaussian auxiliary functions to estimate the gradients from. 
Hence we can choose functions that better resemble the minus log likelihood function. For example a polynomial fit 
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FIG. 7: Binned power spectra, P(k), of the Markov chains sampled from the 6-parameter flat LCDM model using the default 
Metropolis algorithm of CosmoMC (dashed blue line) and the HMC (solid red line). The ratio of efficiencies are given by the 
inverse ratio of P(0) for each method. The HMC is almost a factor of 6 more efficient than the default Metropolis algorithm 
of CosmoMC. Autocorrelation functions for Q.bh 2 chains are also plotted in the small panel for comparison. 



can be made in the following form 

£—£ xi-xi ,x 2 -x 2 , .xi—xi xi-x 1 

= c + ci( ) + c 2 ( ) + c 3 ( )( ) (32) 

ac 0\ (72 Ci (Ti 

Xi - Zi x 2 - X 2 . ,x 2 ~x 2 x 2 ~x 2 

+ c 4 ( )( ) + c 5 ( )( ) + • • • • 

a i a 2 a 2 a 2 

The coefficients of the above equation are found by Lico, the likelihood routine of the Pico package 0. Lico breaks 
the parameter space into 30 regions and performs a fourth order polynomial fit to the £ in each region separately. 
It is possible to modify Lico to compute the gradient of the above equation. This method must give a much better 
estimate of the gradients comparing to the simplest one. 

Fig. El shows the differences between the gradients computed using the two methods. Comparison is done at 100 
random points in the parameter space. 

To use this method in the HMC, we have added a module to CosmoMC that allows CosmoMC to use the above 
method based on Lico to compute the gradient of C whenever the parameters are within the range over which Picos 
regression coefficients are defined. For parameters outside this range, CosmoMC will continue to use the Gaussian 
approximation to compute the gradient. 



C. A Comparison between the HMC and the Monte Carlo method of CosmoMC 

Having the HMC equipped with the above method of gradient estimation, we sample the 6-parameter flat LCDM 
model with an HMC sampler. The leapfrog step-sizes are the chosen to be e = 0.01<7i, where Cj = y/Cu. We randomly 
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Q, b h 2 




e 


T 


n s 


As 


Average 


MCMC 


11.5 


19.6 


11.7 


28.9 


17.8 


13.5 


17.1 


HMC 


3.2 


2.8 


4.0 


2.9 


3.4 


3.8 


3.3 



TABLE I: Autocorrelation lengths of the sampled chains from the 6-parameter flat LCDM model using the HMC and the 
Metropolis method of CosmoMC. 

draw the number of leapfrog steps taken at each iteration from a uniform distribution such that N e ~ Uniform(0, 3)0^. 
We run the chain to generate 16000 samples. This time we get 98% acceptance rate, and very low correlations in the 
chain. We also run CosmoMC using the simple Monte Carlo option (with the optimal step-size ot = 2Aai/\/~D) to 
generate the same number of samples. The acceptance rate is 35%. The autocorrelation lengths of the HMC chain 
for the 6-parameter fiat LCDM model are compared to those of the Metropolis method of CosmoMC in Table [I] The 
average autocorrelation length in the HMC chains are smaller than those of MCMC chains by a factor of ~ 6. This 
can be seen better in the power spectrum plots of Fig. Autocorrelation functions are also plotted in the small panel 
for comparison. The efficiency of the HMC is almost an order of magnitude better than that of the default Metropolis 
method of CosmoMC. Therefore, using the HMC, one needs a much shorter chain to obtain the same accuracy as the 
Metropolis method. 




FIG. 8: Marginalized distributions of the 6-parameter flat LCDM model obtained from the HMC. 



VI. DISCUSSION AND CONCLUSIONS 

With the future data of the upcoming experiments like ACT [6j and Planck [jj that will map the CMB sky on small 
scales (large I), the need to efficient and accurate methods of parameter estimation will be more. Besides finding 
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methods of speeding up power spectrum and likelihood calculation, it is useful to design efficient Markov Chain Monte 
Carlo samplers to do reliable estimations with shorter chains. 

We introduce the Hamiltonian Monte Carlo method (HMC) to improve the efficiency of the MCMC methods used 
in Bayesian parameter estimation in Astrophysics. This technique is based on Hamiltonian dynamics instead of the 
random walk. Hamiltonian dynamics allows the chain to move along trajectories of constant energy, taking large 
jumps in the parameter space with relatively inexpensive computations. This improves the acceptance rate and at 
the same time decreases the correlations among the chains. Unlike the traditional MCMC methods, the efficiency of 
the HMC remains constant even at large dimensions. Therefor shorter chains will be needed for a reliable parameter 
estimation comparing to a traditional MCMC chain yielding the same performance. Besides that, the HMC is well 
suited to sample curved, multi-modal and non-Gaussian distributions which are very hard to sample from using the 
traditional MCMC methods. 

In addition to idealized toy models, we have applied this technique to cosmological parameter estimation. Our 
results have been compared to the outputs of the standard package of CosmoMC. For the 6-parameter flat LCDM 
model, the HMC is almost a factor of 10 more efficient than the default MCMC method of CosmoMC. Therefore, using 
the HMC, one needs a much shorter chain to obtain the same accuracy compared to the Metropolis method. 

In addition to the high efficiency, another important feature of the HMC is that we don't have to compromise on 
the accuracy of our calculations. No approximation is made and the HMC sampler faithfully samples from the target 
distribution. It is shown in this paper that even a rough estimate of the gradient of the logarithm of likelihood can 
be used to make the HMC work. But the better the gradients are estimated the higher the acceptance rate will be. 
Combining this with methods of speeding up power spectrum and likelihood calculation such as Pico will make the 
Bayesian parameter estimation unbelievably fast. 
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